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Abstract — Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text 
classification. In this paper, we propose a fuzzy similarity -based self-constructing algorithm for feature 
clustering. The words in the feature vector of a document set are grouped into clusters, based on similarity test. 
Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a 
membership function with statistical mean and deviation. When all the words have been fed in, a desired 
number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted 
feature, corresponding to a cluster, is a weighted combination of the words contained in the cluster. By this 
algorithm, the derived membership functions match closely with and describe properly the real distribution of 
the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and- 
errorfor determining the appropriate number of extracted features can then be avoided. 

Index Terms — Fuzzy similarity, feature clustering, feature extraction, feature reduction, text classification 

I. INTRODUCTION 

In text classification, the dimensionality of the feature vector is usually huge. For example, 20 
Newsgroups [1] and Reuters21578 top-10 [2], which are two real-world data sets, both have more than 15,000 
features. Such high dimensionality can be a severe obstacle for classification algorithms [3], [4]. To alleviate 
this difficulty, feature reduction approaches are applied before document classification tasks are performed [5] . 
Two major approaches, feature selection [6], [7], [8], [9], [10] and feature extraction [11], [12], [13], have been 
proposed for feature reduction. In general, feature extraction approaches are more effective than feature 
selection techniques, but are more computationally expensive [11], [12], [14]. Therefore, developing scalable 
and efficient feature extraction algorithms is highly demanded for dealing with high -dimensional document data 
sets. 

Classical feature extraction methods aim to convert the representation of the original high -dimensional 
data set into a lower-dimensional data set by a projecting process through algebraic transformations. For 
example, Principal Component Analysis [15], Linear Discriminant Analysis [16], Maximum Margin Criterion 
[12], and Orthogonal Centroid algorithm [17] perform the projection by linear transformations, while Locally 
Linear Embedding [18], ISOMAP [19], and Laplacian Eigenmaps [20] do feature extraction by nonlinear 
transformations. In practice, linear algorithms are in wider use due to their efficiency. Several scalable online 
linear feature extraction algorithms [14], [21], [22], [23] have been proposed to improve the computational 
complexity. However, the complexity of these approaches is still high. Feature clustering [24], [25], [26], [27], 
[28], [29] is one of effective techniques for feature reduction in text classification. The idea of feature clustering 
is to group the original features into clusters with a high degree of pairwise semantic relatedness. Each cluster is 
treated as a single new feature, and, thus, feature dimensionality can be drastically reduced. 

The first feature extraction method based on feature clustering was proposed by Baker and McCallum 
[24], which was derived from the "distributional clustering" idea of Pereira et al. [30]. Al-Mubaid and Umair 
[31] used distributional clustering to generate an efficient representation of documents and applied a learning 
logic approach for training text classifiers. The Agglomerative Information Bottleneck approach was proposed 
by Tishby et al. [25], [29]. The divisive information-theoretic feature clustering algorithm was proposed by 
Dhillon et al. [27], which is an information -theoretic feature clustering approach, and is more effective than 
other feature clustering methods. In these feature clustering methods, each new feature is generated by 
combining a subset of the original words. However, difficulties are associated with these methods. A word is 
exactly assigned to a subset, i.e., hard-clustering, based on the similarity magnitudes between the word and the 
existing subsets, even if the differences among these magnitudes are small. Also, the mean and the variance of a 
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cluster are not considered when similarity with respect to the cluster is computed. Furthermore, these methods 
require the number of new features be specified in advance by the user. 

We propose a fuzzy similarity-based self-constructing feature clustering algorithm, which is an incremental 
feature clustering approach to reduce the number of features for the text classification task. The words in the 
feature vector of a document set are represented as distributions, and processed one after another. Words that are 
similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function 
with statistical mean and deviation. If a word is not similar to any existing cluster, a new cluster is created for 
this word. Similarity between a word and a cluster is defined by considering both the mean and the variance of 
the cluster. When all the words have been fed in, a desired number of clusters are formed automatically. We 
then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted 
combination of the words contained in the cluster. Three ways of weighting, hard, soft, and mixed, are 
introduced. By this algorithm, the derived membership functions match closely with and describe properly the 
real distribution of the training data. Besides, the user need not specify the number of extracted features in 
advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 
Experiments on real world data sets show that our method can run faster and obtain better extracted features 
than other methods. 

II. BACKGROUND AND RELATED WORK 

To process documents, the bag-of-words model [32], [33] is commonly used. Let 

D = {d 1; d 2 , , d n } be a document set of n documents, where d l , d 2 , ,d n are individual documents, and 

each document belongs to one of the classes in the set {c 1 ,c 2 , , c p }. If a document belongs to two or more 

classes, then two or more copies of the document with different classes are included in D. Let the word set 

W = {w 1 , w 2 , , w m ] be the feature vector of the document set. Each document d t , 1 < i < n, is represented 

asd; =< d a , d i2 , , d im >, where each du denotes the number of occurrences of wj in the ith document. The 

feature reduction task is to find a new word set W = {w[, w 2 , ,w k },k « m, such that W and W work 

equally well for all the desired properties with D. After feature reduction, each document d; is converted into a 

new representation d ( =< d a , d i2 , ,d ik > and the converted document set is D = {d 1# d 2 , , d n }. If k is 

much smaller than m, computation cost with subsequent operations on D can be drastically reduced. 

2.1 Feature Reduction 

In general, there are two ways of doing feature reduction, feature selection, and feature extraction. By 

feature selection approaches, a new feature set W = {w[, w 2 , ,w k ] is obtained, which is a subset of the 

original feature set W. Then W is used as inputs for classification tasks. Information Gain (IG) is frequently 
employed in the feature selection approach [10]. It measures the reduced uncertainty by an information -theoretic 

measure and gives each word a weight. The weight of a word w ; is calculated as follows 

p p 

IG(wj) = -^P(c,)Zo 5 P(c ( ) +P(w ; .)^P(c ( |w ; .)Zo 5 P(c,|w ; .) 

i=i i=i 

p 

+ P(^)^P(c ( |^)/o 5 P( C( |^) (1) 

1=1 
where P (c ( ) denotes the prior probability for class c ( , P (w ; ) denotes the prior probability for feature 

Wj, P(wf) is identical to 1 — P(wj), and P(c, |w ; ) and P(c, |v^") denote the probability for class c, with the 
presence and absence, respectively, of w ; . The words of top k weights in W are selected as the features in W . 
In feature extraction approaches, extracted features are obtained by a projecting process through algebraic 
transformations. 

An incremental orthogonal centroid (IOC) algorithm was proposed in [14]. Let a corpus of documents 
be represented as an mxn matrix X 6 R mxn ? where m is the number of features in the feature set and n is the 
number of documents in the document set. IOC tries to find an optimal transformation matrix F* 6 R mxfc ? where 
k is the desired number of extracted features, according to the following criterion: 
F* = arg max trace (F r S i F), (2) 

Where F 6 R mxfc and F r F = I, and 

p 

S b = £ P(c,)(M, - M a „)(M, - M a „) r (3) 

q=l 
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with P(c q ) being the prior probability for a pattern belonging to class c q , M q being the mean vector of 
class c q , and M a(( being the mean vector of all patterns. 

2.2 Feature Clustering 

Feature clustering is an efficient approach for feature reduction [25], [29], which groups all features 
into some clusters, where features in a cluster are similar to each other. The feature clustering methods proposed 
in [24], [25], [27], [29] are "hard" clustering methods, where each word of the original features belongs to 
exactly one word cluster. Therefore each word contributes to the synthesis of only one new feature. Each new 
feature is obtained by summing up the words belonging to one cluster. Let D be the matrix consisting of all the 
original documents with m features and D be the matrix consisting of the converted documents with new k 

features. The new feature set W = {w[, w' 2 , ,w k ] corresponds to a artition {W 1 ,W 2 , , W fc } of the 

original feature set W, i.e., W t fl W q = 0, where 1 < q, t < k and t ^ q. Note that a cluster corresponds to an 

element in the partition. Then, the tth feature value of the converted document d ( is calculated as follows: 

Wj ew t 

which is a linear sum of the feature values in W t . The divisive information -theoretic feature clustering 
(DC) algorithm, proposed by Dhillon et al. [27] calculates the distributions of words over classes, P(c| w ; ), 1 < 
j < m, where, C = [c 1 ,c 2 , ....,c p ) and uses Kullback-Leibler divergence to measure the dissimilarity between 
two distributions. 
The distribution of a cluster W t is calculated as follows: 

P( W j) 






Wj ew f 



lwj€W t Py w j) 



The goal of DC is to minimize the following objective function: 

k 

£ £ P(wj)KL(P(c\ W] ),P(.C\W t )) (6) 

t=l WjGWt 

which takes the sum over all the k clusters, where k is specified by the user in advance. 

III. OUR METHOD 

There are some issues pertinent to most of the existing feature clustering methods. First, the parameter 
k, indicating the desired number of extracted features, has to be specified in advance. This gives a burden to the 
user, since trial-and-error has to be done until the appropriate number of extracted features is found. Second, 
when calculating similarities, the variance of the underlying cluster is not considered. Intuitively, the 
distribution of the data in a cluster is an important factor in the calculation of similarity. Third, all words in a 
cluster have the same degree of contribution to the resulting extracted feature. Sometimes, it may be better if 
more similar words are allowed to have bigger degrees of contribution. Our feature clustering algorithm is 
proposed to deal with these issues. 

Suppose, we are given a document set D of n documents d 1# d 2 , d n together with the feature vector 

W of m words w 1 ,w 2 , w m and p classes c l ,c 2 , c p as specified in Section 2. We construct one word 

pattern for each word in W. For word w h its word pattern x ; is defined, similarly as in [27], by 

=< P(q \wt), P(c 2 \w t ) P(c v \w t ) > (7) 

Where 

PtcM) ^^ 1 * 5 * (8) 

i-'q=\ a qi 

for 1 < j <p. Note that d qi indicates the number of occurrences of w t in document d q , as described in Section 

2. 

Also, S q j is defined as 



_ (1, if document d q belongs to class c ; 
qi ~ lo, otherwise. ^ ' 



Therefore, we have m word patterns in total. It is these word patterns, our clustering algorithm will 
work on. Our goal is to group the words in W into clusters, based on these word patterns. A cluster contains a 
certain number of word patterns, and is characterized by the product of p one-dimensional Gaussian functions. 
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Gaussian functions are adopted because of their superiority over other functions in performance [34], [35]. Let 

G be a cluster containing q word patterns x 1 , x 2 , . . . . x q . Let x ; =< Xj 1 , x- 2 , , X: p >, 1 < j < q . Then the 

mean m =< m l , m 2 , , m p > and the deviation a =< a 1 , a 2 , , er p > of G are defined as 

m t = -^- (10) 



ai - \ — ici — (11) 

for 1 < i < p, where \G\ denotes the size of G, i.e., the number of word patterns contained in G. The fuzzy 
similarity of a word pattern x =< x 1 , x 2 , ,x p > to cluster G is defined by the following membership 

function: 

p 



Mc 



(x) = []exp-(^-^ 



(12) 



Notice that < /%(x) < 1. A word pattern close to the mean of a cluster is regarded to be very similar to this 

cluster, i.e., 

Mc( x ) ~ 1- On the contrary, a word pattern far distant from a cluster is hardly similar to this cluster, i.e., 

M G (x)«0. 

3.1 Self-Constructing Clustering 

Our clustering algorithm is an incremental, self-constructing learning approach. Word patterns are 
considered one by one. The user does not need to have any idea about the number of clusters in advance. No 
clusters exist at the beginning, and clusters can be created if necessary. For each word pattern, the similarity of 
this word pattern to each existing cluster is calculated to decide whether it is combined into an existing cluster or 
a new cluster is created. Once a new cluster is created, the corresponding membership function should be 
initialized. On the contrary, when the word pattern is combined into an existing cluster, the membership 
function of that cluster should be updated accordingly. 

Let k be the number of currently existing clusters. The clusters are G 1 , G 2 , ... . G k , respectively. Each 

cluster G ; has mean m ; =<m ;1 , m ;2 , ,JTU, > and deviation ay =<Oj 1 , Oj 2 , ,er ;p >. Let Si be the 

size of cluster G ; . Initially, we have k = 0. So, no clusters exist at the beginning. For each word pattern x ; =< 

x n , x i2 , ,x ip >, 1 < i < m, we calculate, according to (13), the similarity of x ( to each existing clusters, 

i.e. 

P r 2i 

Hej 0O = n exp - h'^ J (13) 

for 1 < j < k. We say that x ; passes the similarity test on cluster G, if 

-"C; (X;) > P (14) 

where p, < p < 1, is a predefined threshold. If the user intends to have larger clusters, then he/she 
can give a smaller threshold. Otherwise, a bigger threshold can be given. As the threshold increases, the number 
of clusters also increases. Note that, as usual, the power in (13) is 2 [34], [35]. Its value has an effect on the 
number of clusters obtained. A larger value will make the boundaries of the Gaussian function sharper, and 
more clusters will be obtained for a given threshold. On the contrary, a smaller value will make the boundaries 
of the Gaussian function smoother, and fewer clusters will be obtained instead. 

Two cases may occur. First, there are no existing fuzzy clusters on which xi has passed the similarity 
test. For this case, we assume that xi is not similar enough to any existing cluster and a new cluster G h h = k + 1, 
is created with 
m h = X;, a h = a (15) 

where a =< o , o > is a user-defined constant vector. Note that the new cluster G h contains 

only one member, the word pattern x ; , at this point. Estimating the deviation of a cluster by (1 1) is impossible, 
or inaccurate, if the cluster contains few members. In particular, the deviation of a new cluster is 0, since it 
contains only one member. We cannot use zero deviation in the calculation of fuzzy similarities. Therefore, we 
initialize the deviation of a newly created cluster by a , as indicated in (15). Of course, the number of clusters is 
increased by 1 and the size of cluster G h , S h „ should be initialized, i.e., 
k = k + 1, S h = 1 (16) 

Second, if there are existing clusters on which x ; has passed the similarity test, let cluster G h be the 

cluster with the largest membership degree, i.e., 

t = org max p. Gj (x ; ) (17) 

l<j<k ' 
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In this case, we regard x ; to be most similar to cluster G t , and m t and a t of cluster G t should be modified to 
include x ; as its member. The modification to cluster G t is described as follows: 

m «= S t ll (18) 

a tj = ^IA-B + <x (19) 

_ (S t - l)(g t; - g ) + S t x m| + xfj 
a — <, (^U) 

S, + 1/S t xm tt +x u \ 2 

g= v( 5 t :i ) (2i) 

for 1 < j < p, and 

5 t = S t + 1 (22) 

Equations (18) and (19) can be derived easily from (10) and (11). Note that k is not changed in this case. 
The whole clustering algorithm can be summarized below. 

Initialization: 

# of original word patterns: m 

# of classes: p 
Threshold: p 
Initial deviation: Ob 
Initial # of clusters: k = 
Input: 

X; =< x n ,x i2 , ,x ip >, 1 < i < m 

Output: 

Clusters G 1 ,G 2 , ... . G fc 
procedure Self-Constructing-Clustering- Algorithm 
for each word pattern x t ,l < i < m 
temp_W = {Gj \p GJ (x ; ) > p, 1 < ; < k}; 
if (temp _W ==0) 

A new clusterG ft , h = k + 1, is created by (15)-(16); 
else let G t 6 tempJW be the cluster to which x ; is 
closest by (17); 

Incorporate X; into G t by (18)-(22); 
endif; 
endfor; 

return with the created k clusters; 
endprocedure 

Note that the word patterns in a cluster have a high degree of similarity to each other. Besides, when 
new training patterns are considered, the existing clusters can be adjusted or new clusters can be created, 
without the necessity of generating the whole set of clusters from the scratch. 

Note that the order in which the word patterns are fed in influences the clusters obtained. We apply a 
heuristic to determine the order. We sort all the patterns, in decreasing order, by their largest components. Then 
the word patterns are fed in this order. In this way, more significant patterns will be fed in first and likely 
become the core of the underlying cluster. This heuristic seems to work well. 

We discuss briefly here the computational cost of our method and compare it with DC [27], IOC [14], 
and IG [10]. For an input pattern, we have to calculate the similarity between the input pattern and every 
existing cluster. Each pattern consists of p components where p is the number of classes in the document set. 
Therefore, in worst case, the time complexity of our method is O(mkp), where m is the number of original 
features and k is the number of clusters finally obtained. For DC, the complexity is O(mkpt), where t is the 
number of iterations to be done. The complexity of IG is O(mp+m\ogm), and the complexity of IOC is O(mkpn), 
where n is the number of documents involved. Apparently, IG is the quickest one. Our method is better than DC 
and IOC. 

3.2 Feature Extraction 

Formally, feature extraction can be expressed in the following form: 
D' = DT (23) 
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(24) 
(25) 

(26) 



for 1 < i < n. Clearly, T is a weighting matrix. The goal of feature reduction is achieved by finding an 
appropriate T such that k is smaller than m. In the divisive information theoretic feature clustering algorithm 
[27] described in Section 2.2, the elements of T in (25) are binary and can be defined as follows: 



fl. if Wi eWj, 
l] 10, otherwise, v J 



where 1 < i < m and 1 < j < k. That is, if a word wi belongs to cluster V\^ , t i; is 1; otherwise t« is 0. 

By applying our clustering algorithm, word patterns have been grouped into clusters, and words in the 
feature vector W are also clustered accordingly. For one cluster, we have one extracted feature. Since we have k 
clusters, we have k extracted features. The elements of T are derived based on the obtained clusters, and feature 
extraction will be done. We propose three weighting approaches: hard, soft, and mixed. In the hard -weighting 
approach, each word is only allowed to belong to a cluster, and so it only contributes to a new extracted feature. 
In this case, the elements of T in (23) are defined as follows: 

{1, if j = or a max u ni (x.) , 
JJ y i<j<k^ yiJ ' (28 ) 

0, otherwise, 

Note that if j is not unique in (28), one of them is chosen randomly. In the soft-weighting approach, 
each word is allowed to contribute to all new extracted features, with the degrees depending on the values of the 
membership functions. The elements of T in (23) are defined as follows: 
ttj = Hgj 00 (29) 

The mixed-weighting approach is a combination of the hard-weighting approach and the soft-weighting 
approach. For this case, the elements of T in (23) are defined as follows: 
ty =(y)xt£ + (l-y)Xt£ (30) 

where tjj is obtained by (28) and tf, is obtained by (29), and y is a user-defined constant lying between 
and 1 . Note that is not related to the clustering. It concerns the merge of component features in a cluster into a 
resulting feature. The merge can be "hard" or "soft" by setting y to 1 or 0. By selecting the value of y, we 
provide flexibility to the user. When the similarity threshold is small, the number of clusters is small, and each 
cluster covers more training patterns. In this case, a smaller y will favor soft-weighting and get a higher 
accuracy. On the contrary, when the similarity threshold is large, the number of clusters is large, and each 
cluster covers fewer training patterns. In this case, a larger y will favor hard-weighting and get a higher 
accuracy. 

3.3 Text Classification 

Given a set D of training documents, text classification can be done as follows: We specify the 

similarity threshold p for (15), and apply our clustering algorithm. Assume that k clusters are obtained for the 

words in the feature vector W. Then we find the weighting matrix T and convert D to D by (23). Using D as 

training data, a classifier based on support vector machines (SVM) is built. Note that any classifying technique 

other than SVM can be applied. Joachims [36] showed that SVM is better than other methods for text 

categorization. SVM is a kernel method, which finds the maximum margin hyper plane in feature space 

separating the images of the training patterns into two groups [37], [38], [39]. To make the method more flexible 

and robust, some patterns need not be correctly classified by the hyperplane, but the misclassified patterns 

should be penalized. Therefore, slack variables f t are introduced to account for misclassifications. The objective 

function and constraints of the classification problem can be formulated as: 
i 

min-w r w+cy^ (31) 

i=l 

s. t. y t (w r (/)(Xj) + b) > 1 - f ;, £ > 0, i = 1, 2, , I, 

where I is the number of training patterns, C is a parameter, which gives a tradeoff between maximum 
margin and classification error, and y t , being +1 or -1, is the target label of pattern x ; . Note that (p:X -* F is a 
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mapping from the input space to the feature space F, where patterns are more easily separated, and w T (f)(x i ) + 
b = is the hyper plane to be derived with w, and b being weight vector and offset, respectively. 
An SVM described above can only separate apart two classes, y i = +1 and y t = — 1. We follow the idea in [36] 
to construct an SVM-based classifier. For p classes, we create p SVMs, one SVM for each class. For the SVM 
of class c v , 1 < v < p, the training patterns of class c v are treated as having y t = +1, and the training patterns 
of the other classes are treated as having y t = — 1. The classifier is then the aggregation of these SVMs. Now we 
are ready for classifying unknown documents. Suppose, d is an unknown document. We first convert d to d by 
d' = dT (32) 

Then we feed d to the classifier. We get p values, one from each SVM. Then d belongs to those 
classes with 1, appearing at the outputs of their corresponding SVMs. 

IV. AN EXAMPLE FOR PROPOSED METHOD 

We give an example here to illustrate how our method works. Let D be a simple doci.iment set containing 9 documents 

d., d 2 , d ? of two classes c 1 andf;, with 10 words : "office : :: : "building :: ; . . . ; : 'fridge :: in the feature vector W. as 

shown in Tab; e 1. For simplicity, we denotethe ten words aswj, w : , w 10 . respectively. 

We calculate the ten word patterns x L ,Xj, x 10 according to (7) and (8). The x t =< P(c 1 \w i ) 1 P(_c 2 \w^)> is 

calculated by [35). 

The resulting word patterns are shown in Table 2. Note that each word pattern is a two-dimensional vector, since mere are 

two classes involved mj). 

We run our self- constructing clustering algorithm, by setting u = 0.5 gndj? = 0.64 7 on the word patterns and obtain 3 

clusters G._, G 2 snA G 3 . which are shown in Table 3. The fuzzy similarity of each word pattern to each cluster is shown in 

Table 4. The weighting matrices T\,-,T 5 ar.d T-, ; obtained by hard-weighting, soft-weighting and mixed weighting 

[with }' = 0.8, respectively, are shown in Table 5. The transformed data sets D : : : .D; ar.d Dvj obtained by [25) for different 

cases of weighting are shown in Table 6. 

Based on Dc;.D§ ar.d Dr. ]; a classifier with two SVMs is built. Suppose d is an unknown document, and d =< 
0,1 1,1.1,1 0, 1, 1,1 >. We first convert d to d' by [34). Then, the transformed document is obtained &sji£ = dT\,- =< 
2.4.2 >, dj = dT ; =< 2.55914.3478,3, 9964 >~ ord^ ; = dT-,- =< 2.1118,4.0696,2.3993 >. Then the transformed 
unknown document is fed to the classifier. For this example, the classifier concludes that d belongs to c 2 . 

TABLE 1 A Simple Document Set D 
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TABLE 2 Word Patterns of W 



Xl 


x 2 


x 3 


X4 


x 5 


x 6 


x 7 


x 8 


x 9 


X10 


0.00 


0.20 


0.2 



0.0 



1.0 




0.5 



1.00 


0.67 


0.00 


1.00 


1.00 


0.80 


0.8 



1.0 



0.0 



0.5 




0.00 


0.33 


1.00 


0.00 



TABLE 3 Three Clusters Obtained 



Cluster 


Size S 


mean m 


Deviation a 


Gi 


3 


<1,0> 


< 0.5,0.5 > 


G 2 


5 


< 0.08,0.92 > 


< 0.6095,0.6095 > 
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G 3 2 < 0.5833,0.4167 > < 0.6179,0.6179 > 



TABLE 4 Fuzzy Similarities of Word Patterns to Three Clusters 



similarit 

y 


X] 


x 2 


x 3 


x 4 


x 5 


x 6 


x 7 


x 8 


x 9 


Xio 


A^,(x) 


0.0003 


0.0060 


0.0060 


0.0003 


1.0000 


0.1353 


1.0000 


0.411 


0.0003 


1.0000 


A^ 2 (x) 


0.9661 


0.9254 


0.9254 


0.9661 


0.0105 


0.3869 


0.0105 


0.156 
8 


0.9661 


0.0105 


A^,(x) 


0.1682 


0.4631 


0.4631 


0.1682 


0.4027 


0.9643 


0.4027 


0.964 

3 


0.1682 


0.4027 



TABLE 5 Weighting Matrices: Hard T H , Soft T s , and Mixed T y 
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TABLE 6 Transformed Data Sets: Hard DL Soft D' f and Mixed D^ 
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V. CONCLUSION 

We have presented a fuzzy self-constructing feature clustering (FFC) algorithm, which is an 
incremental clustering approach to reduce the dimensionality of the features in text classification. Features that 
are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership 
function with statistical mean and deviation. If a word is not similar to any existing cluster, a new cluster is 
created for this word. Similarity between a word and a cluster is defined by considering both the mean and the 
variance of the cluster. When all the words have been fed in, a desired number of clusters are formed 
automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a 
cluster is a weighted combination of the words contained in the cluster. By this algorithm, the derived 
membership functions match closely with and describe properly the real distribution of the training data. 
Besides, the user need not specify the number of extracted features in advance, and trial-and-error for 
determining the appropriate number of extracted features can then be avoided. 
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